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[57] ABSTRACT 

A method and apparatus for achieving fault tolerance in 
a computer system having at least a first central process- 
ing unit and a second central processing unit. The 
method comprises the steps of first executing a first 
algorithm in the first central processing unit on input 
which produce* a first output as well as a certification 
trail. Next, executing a second algorithm in the second 
central processing unit on the input and on at least a 
portion of the certification trail which produces a sec- 
ond output. The second algorithm has a faster execution 
time than the first algorithm for a given input. Then, 
comparing the first and second outputs such that an 
error result is produced if the first and second outputs 
are not the same. The step of executing a first algorithm 
and the step of executing a second algorithm preferably 
takes place over essentially the sajne time period. 

18 Claims, 6 Drawing Sheets 
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FIG. 1 


Algorithm MINSPAN(G, weight) 

Input < Connected groph G « (V,E) where V « ft, . . ..nj with edge weights. 
Output' Sponning tree'of G which hos minimum weight 

1 CHOOSE root «V 

2 FOR ALL u « V, key (u) ! * <0 END FOR 

3 h : * 0 ; v : * root 

4 WHILE v 4 empty DO 

5 keylv) : « —® 

6 FOR EACH [v.wj « £ DO 

7 IF weight (tv,w])< key (w) THEN 

8 key(w):* weight ((v,w]);prefer (w) : « [v,w] 

9 IF member(w,h) THEN chonaekey (w.key (w),h) 

10 ELSE insert (w.key (w),h) END IF 

11 ENO IF 

12 END FOR 

13 (v,k) : • deletemin (h) 

14 END WHILE 

15 FOR ALL u«V-froot} OUTPUT (prefer (ul) END FOR 
END MINSPAN 1 


FIG. 3 
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FIG. 2(e) ex 200 


FIG- 2(f) (Tl 200 
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FIG. 4(a) 



FIG. 4(b) 


Algorithm HUFFMAN (FREQ) 

Input: Sequence of positive integer! FRE0.*|f HJ,f 121 f Cn 

Output : Pointer too Huffmon tree for the input frequencies 

1 FOR i s « 1 to n DO 

2 insert (i,f (ij,h) 

3 ptr Cil« • ollocoteO 

4 info[ptr Ci]]:>(i,f[i]) 

5 ENO FOR 

6 FORJ:«n*1 to 2n - 1 00 

7 (iteml, key 1 ) • • deleteminlh) 

8 (item 2, key2): ■ deletemin (h) 

9 ptr C|l* • ollocotel) 

10 infotptrtj]]: *1], key 1 ♦ key 2) 

11 left [ptr£j]]«« ptr [item 1] 

•? right [ptr ptr [item 23 

13 insert (j, key 1 ♦ key 2, h) 

14 END FOR 

15 OUTPUT (ptr [2n-1]) 

END HUFFMAN 


FIG. 5 
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Algorithm CONVEXHULL(S) 

Incut : Set of points, S, in , c 

Output ! CwnUreloetiwi*. ..w»c. of poi.H i. R 2 *? ch *1'" .T. 1 i.*l 

1 Lot pi be the point with the lorgest * coord mote t on<l smallest » to b 

2 For eoch point p (except pt) colculot# the slop* of ,h * t hem P o2 on 

3 Sort the point, (except p1| from the smollest slope to the lorgest. Coll themp2,...pn 

4 ql: « pi i q2 :*p2; q3 : *p3j tn • S 

i f0 w R „Ur^S°,^ t, , m -i.,m.pi, u t ieo ...... DO «. ■—-! end for 

7 m m + 1 

b qih i» pe 

9 END FOR 

10 FOR i< 1 to m DO, OUTPUT (qi) ENO FOR 
ENO CONVEXHULL 


FIG. 7 
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METHOD AND APPARATUS FOR FAULT 
TOLERANCE 

LICENSES 

The United States Government has a paid-up non- 
exclusive license to practice the claimed invention 
herein as per NSF Grant CCR-8910569 and NASA 
Gram NSG 1442. 

FIELD OF THE INVENTION 

The present invention relates to fault tolerance. More 
specifically, the present invention relates to a first algo- 
rithm that provides a certification trail to a second algo- 
rithm for fault tolerance purposes. 

BACKGROUND OF THE INVENTION 

Traditionally, with respect to fault tolerance, the 
specification of a problem is given and an algorithm to 
solve it is constructed. This algorithm is executed on an 
input and the output is stored. Next, the same algorithm 
is executed again on the same input and the output is 
compared to the earlier output. If the outputs differ then 
an error is indicated, otherwise the output is accepted as 
correct. This software fault tolerance method requires 
additional time, so called time redundancy [Johnson, B., 
Design and analysis of fault tolerant digital systems, 
Addison-Wesley, Reading Mass., 1989; Siewiorek, D., 
and Swarz, R., The theory and practice of reliable de- 
sign, Digital Press, Bedford, Mass., 1982]; however, it 
requires not additional software. It is particularly valu- 
able for detecting errors caused by transient fault phe- 
nomena. If such faults cause an error during only one of 
the executions then either the error will be detected or 
the output will be correct. 

A variation of the above method uses two separate 
algorithms, one for each execution, which have been 
written independently based on the problem specifica- 
tion. This technique, call N-version programming 
[Chen, L., and Avizienis A., "N-vcrsion programming: 
a fault tolerant approach to reliability of software oper- 
ation," Digest of the 1978 Fault Tolerant Computing 
Symposium, pp. 3-9, IEEE Computer Society Press, 
1978; Avizienis, A., "The N-version approach to fault 
tolerant software,” IEEE Trans, on Software Engineer- 
ing, vol. 11, pp. 1491-1501, December, 1985] (in this 
case N = 2), allows for the detection of errors caused by 
some faults in the software in addition to those caused 
by transient hardware faults and utilizes both time and 
software redundancy. Errors caused by software faults 
are detected whenever the independently written pro- 
grams do not generate coincident errors. 

SUMMARY OF THE INVENTION 

The present invention pertains to a method for 
achieving fault tolerance in a computer system having 
at least a first central processing system and a second 
central processing system. The method comprises the 
steps of first executing a first algorithm in the first cen- 
tral processing unit on input which produces a first 
output as well as a certification trail. Next, executing a 
second algorithm in the second central processing unit 
on the input and on at least a portion of the certification 
trail which produces a second output. The second algo- 
rithm has a faster execution time than the first algorithm 
for a given input. Then, comparing the first and second 
outputs such that an error result is produced if the first 
and second outputs are not the same. The step of execut • 
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ing a first algorithm and the step of executing a second 
algorithm preferably takes place over essentially the 
same time period. 

The present invention also pertains to a method for 
5 achieving fault tolerance in a central processing unit. 
The method comprises the steps of executing a first 
algorithm in the central processing unit on input w hich 
produces the first output as well as a certification trail. 
10 Then, there is the step of executing a second algorithm 
in the central processing unit on the input and on at least 
a portion of the certification trail which produces a 
second output. The second algorithm has a faster execu- 
tion time than the first algorithm for a given input. 
15 Then, there is the step of comparing the first and second 
outputs such that an error result is produced if the first 
and second outputs are not the same. 

The present invention also pertains to a computer 
system. The computer system comprises a first com* 
20 puter. The first computer has a first memory. The first 
computer also has a first central processing unit in com- 
munication with the memory. The first computer addi- 
tionally has a first input port in communication with the 
memory in the first central processing unit. There is a 
first algorithm disposed in the first memory which pro- 
duces a first output as well as a certification trail based 
on input received by the input port when it is executed 
by the first central processor. The computer system is 
30 additionally comprised of a second computer. The sec- 
ond computer is comprised of a second memory. The 
second computer is also comprised of a second central 
processing unit in communication with the memory and 
the first central processing unit. The second computer 
35 additionally is comprised of a second input port in com- 
munication with the memory in the second central pro- 
cessing unit. There is a second algorithm disposed in the 
second memory which produces a second output based 
on the input and on at least a portion of the certification 
40 trail when the second algorithm is executed by the sec- 
ond central processing unit. The second algorithm has a 
faster execution time than the first algorithm for a given 
input. The computer system is also comprised of a 
42 mechanism for comparing the first and second outputs 
such that an error result is produced if the first and 
second outputs are not the same. 

Moreover, the present invention also pertains to a 
computer. The computer is comprised of a memory. 
50 Additionally, the computer is comprised of a central 
processing unit in communication with the memory. 
The computer is additionally comprised of a first input 
port in communication with the memory and the central 
processing unit. There is a first algorithm disposed in 
55 the memory which produces a first output as well as a 
certification trail baUd on input received by the input 
port when the input is executed by the first central 
processor. There is a second algorithm also disposed in 
the memory which produces a second output based on 
the input and on at least a portion of the certification 
trail when the second algorithm is executed by the cen- 
tral processing unit. The second algorithm has a faster 
execution time than the first algorithm for a given input 
65 Moreover, the computer is comprised of a mechanism 
for comparing the first and second outputs such that an 
erfor result is produced if the first and second outputs 
are not the same. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

In the accompanying drawings, the preferred em- 
bodiments of the invention and preferred methods of 
practicing the invention are illustrated in which: 

FIG. 1 is a block diagram of the present invention. 

FIGS. 2A through FIG. 2F shows an examples of a 
minimum spanning tree algorithm. 

FIG. 3 with the source code for a mince man algo- 
rithm. 

FIG. 4A and 4B shows an example of a data structure 
used in the second execution of a mince man algorithm. 

FIG. 5 with the source code for a Huffman algo- 
rithm. 

FIG. 6 shows an example of a Huffman tree. 

FIG. 7 with the source code for Graham’s scan algo- 
rithm. 

FIG. SA through FIG. SC shows a convex hull exam- 
ple. 

FIG. 9 is a block diagram of an apparatus of the 
present invention. 

FIG. 10 is a block diagram of another embodiment of 
the present invention. 

FIG. 11 is a block diagram of another embodiment of 
the present invention. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

The central idea of the present invention, essentially a 
fault tolerance mechanism, as illustrated in FIG. 1, is to 
modify a first algorithm so that it leaves behind a trail of 
dau which is called a certification trail. This data is 
chosen so that it can allow a second algorithm to exe- 
cute more quickly and/or have a simpler structure than 
the first algorithm. The outputs of the two executions 
are compared and are considered correct only if they 
agree. Note, however, care must be taken in defining 
this method or else its error detection capability might 
be reduced by the introduction of data dependent be- 
tween the two algorithm executions. For example, sup- 
pose the first algorithm execution contains a error 
which causes an incorrect output and an incorrect trial 
of data to be generated. Further suppose that no error 
occurs during the execution of the second algorithm. It 
still appears possible that the execution of the second 
algorithm might use the incorrect trail to generate an 
incorrect output which matches the incorrect output 
given by the execution of the first algorithm. Intu- 
itively, the second execution would be "fooled" by the 
data left behind by the first execution. The definitions 
given below exclude this possibility. They demand that 
the second execution either generates a correct answer 
or signals the fact that an error has been detected in the 
data trail. Finally, it should be noted that in FIG. 1 both 
executions can signal an error. These errors would in- 
! elude run-time errors such as divided-by-zero or non- 
terminating computation. In addition the second execu- 
tion can signal error due to an incorrect certification 
trail. The fault tolerance means can be used in hardware 
or software systems and manifested as firmware or soft- 
ware in a central processing unit. 

A formal definition of a certification trail is the fol- 
lowing. 

Definition 2. 1 . A problem P is formalized as a relation 
(that is, a set of ordered pairs). Let D be the domain 
(that is, the set of inputs) of the relation P and let S be 
the range (that is, the set of solutions) for the problem. 
It can be said an algorithm A solves a problem P if for 
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all d < D when d is input to A then an s c S is output such 
that (d.s) € P. _ 

Definition 2.2. Let P : D - S be a problem. Let T be 
the set of certification trails. A solution to this problem 
5 using a certification trail consists of two functions Fi 
and F: with the following domains and ranges Fi:D -• 

S X T and Fj:D x T-SU error. The functions must 
satisfy the following two properties: 

(1) for all d € D there exists s < S and there exists t < 
10 T such that Fi(d) = (M) and Fj(d,t) * s and (d.s) c P 

(2) for all d € D and for all t € T either (Fj(d,t) = s and 
(d,*) c P) or Fj(d,t) * error. 

The definitions above assure that the error detection 
capability of the certification trail approach is compara- 
13 ble to that obtained with the simple time redundancy 
approach discussed earlier. That is, if transient hard- 
ware faults occur during only one of the executions 
then either an error will be detected or the output will 
be correct It should be further noted, however, the 
20 examples to be considered will indicate that this new 
approach can also save overall execution time. 

The certification trial approach also allows for the 
detection of faults in software. As in N-version pro- 
gramming, separate teams can write the specification 
25 now must include precise information describing the 
generation and use of the certification trial. Because of 
the additional dau available to the second execution, 
the specifications of the two phases can be very differ- 
ent; similarly, the two algorithms used to implement the 
30 phases can be very different This will be illustrated in 
the convex hull example to be considered later. Altema- - 
tively, the two algorithms can be very similar, differing 
only in dau structure manipulations. This will be illus- 
trated in the minimum spanning tree and Huffman tree 
35 examples to be considered later. When significantly 
different algorithms arc used it is sometimes possible to 
save programming effort by sharing program code. 
While this reduces the ability to detect errors in the 
software it does not change the ability to detect tran- 
40 sient hardware errors as discussed earlier. 

With respect to the above, it has been assumed that 
our method is implemented with software; however, it 
is clearly possible to implement the certification trail 
technique by using dedicated hardware. It is also possi- 
45 ble to generalize the basic two-level hierarchy of the 
certification trial approach as illustrated in FIG. 1 to 
higher levels. 

Examples of the Certification Trail Technique 

50 In this section, there is illustrated the use of certifica- 
tion trails by means of applications to three well-known 
and significant problems in computer science: the mini- 
mum spanning tree problem, the Huffman tree problem, 
and the convex hull problem. It should be stressed here 
55 that the certification trail approach is not limited to 
these problems. Rather, these algorithm* have been 
selected only to give illustrations of this technique. 

Minimum Spanning Tree Example 

60 The minimum spanning tree problem has been exam- 
ined extensively in the literature and an historical sur- 
vey is given in [Graham, R.L., “An efficient algorithm 
for determining the convex hull of a planar set", Infor- 
mation Processing Letters, pp. 132-133, 1, 1972]. The 
65 certification trial approach is applied to a variant of the 
Prim/Dijkstra algorithm ]Prim, R.C., "Shortest con- 
nection networks and some generalizations,: Bell Syst. 
Tech. J., pp. 1339-1401, November, 1957; Dijkstra, E. 


i 

1— 



I I 

Li 


a 

U 




5 , 243,607 

5 

W. f ”A note on two problems in connexion with 
graphs ” Numer. Math I, pp 269-1984, Jun 20-22] as 
explicated in (Tarjan, R.E., Data Structures and Net- 
work Algorithms, Society for Industrial and applied 
Mathematics, Philadelphia, Pa. 1983]. The discussion of 
the application of the certification trail approach to the 
minimum spanning tree problem beings with some pre- 
liminary definitions. 

Definition 3.1. A graph O * (V,E) consists of a ver- 
tex set V and an edge set E. An edge is an unordered 
pair of distinct vertices which is notated as, for example. 

[v,w], and it is said v is adjacent to w. A path in a graph 
from V| to v* is a sequence of vertices V|, vj, . . . , v* such 

that [v |p v| * j] is an edge for i c [1 k - 1]. A path 

is a cycle if k > 1 and V| » v*. An acyclic graph is a 
graph which contains no cycles. A connected graph is a 
graph such that for all pairs of vertices v,w there is a 
path from v to w. A tree is an acyclic and connected 
graph 
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In our case, there is used two different data structure 
methods to support these operations. One method will 
be used in the first execution of the algorithm and an- 
other, faster and simpler, method will be used in the 
second execution. The second method relies on a trail of 
dau which is output by the first execution. 

MINSPAN ALGORITHM 

Before discussing precise implementation details for 
these methods the overall algorithm used in both execu- 
tions is presented. Pidgin code for this algorithm ap- 
pears below. In addition, FIG. 2 illustrates the execu- 
tion of the algorithm on a sample graph and the table 
below records the data structure operations the algo- 
rithm must perform when run on the sample graph. The 
fist column of the table gives the operations except 
member and the parameter h dropped to reduce clutter. 
The second column gives the evolving contents of h. 
The third column records the ordered pair deleted by 


Definition 3.2. Let G = (V,E) be a graph and let w be 20 °P ,umn recordt *** ordered pair deleted by 

a positive rational valued function defined on E. A th^deletemw operation. The fourth column records to 
subtree of G is a tree, T(V'.E'), with V ' C, V and E ' C * ™ 

E. It is said T spans V' and V' is spanned by T. If V « 

V then we say T is a spanning tree of G The weight of 


a 25 


this tree is <£ , w(e). A minimum spanning tree is 
spanning tree of minimum weight. 

Dau Structures and Supported Operations 

Before discussion of the minimum spanning tree algo- 
rithm, there must be described the properties of the 30 
principle dau structure that are required. Since many 
different data structures can be used to implement the 
algorithm, initially there is described abstractly the dau 
that can be stored by the dau structure and the opera- 
tions that can be used to manipulate this dau. The dau 35 
consists of set of ordered pairs. The first element in 
these ordered pairs is referred to as the item number and 
the second element is called the key value. Ordered 
pairs may be added and removed from the set; however, 
at all times, the item numbers of distinct ordered pairs 40 
must be distinct. It is possible, through, for multiple 
ordered pairs to have the same key value. In this paper 
the item numbers are integers between 1 and n. inclu- 
sive. Our default convention is that i is an item number, 
k is a key value and h is a set of ordered pairs. A total 
ordering on the pairs of a set can be defined lexico- 
graphically as follows: (i,k) < (i\k') iff k < k' or (k = 
k and i < i ). The dau structure should support a subset 
of the following operations. 

member (i,h) returns a boolean value of true if h con- 
tains an ordered pair with item number i, otherwise 
returns false. 

tnser* (i,k,h) adds the ordered pair (i,k) to the set h. 
delete (i,h) deletes the unique ordered pair with item 
number i from h. 

changekey (i,k,h) is executed only when there is an 
ordered pair with item number i and h. This pair is 
replaced by (i,k). 

deletemin (h) returns the ordered pair which is smallest 
according to the total order defined above and de- 
letes this pair. If h is the empty set then the token 

empty” is returned. 

predecessor (i,h) returns the item number of the ordered 
pair which immediately precedes the pair with item 
number i in the total order. If there is no predecessor 65 
then the token “smallest" is returned. 

Many different types and combinations of dau struc- 
tures can be used to support these operations efficiently. 
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certification trail corresponding to these operations and 
is further discussed below. 

Th* algorithm uses a “greedy" method to “grow” a 
minimum spanning tree. The algorithm starts by choos- 
ing an arbitrary vertex from which to grow the tree. 
During each iteration of the algorithm a new edge is 
added to the tree being constructed. Thus, the set of 
vertices spanned by the tree increases by exactly one 
vertex for each iteration. The edge which is added to 
the tree is the one with the smallest weight. FIG. 2 
shows this process in action. FIG. 2(a) shows the input 
graph, FIGS. 2(b) through 2(e) show several stages of 
the tree growth and FIG. 2(f) shows the final output of 
the minimum spanning tree. The solid edges in FIGS. 
2(b) through 2(e) represent the current tree and the 
dotted edges represent candidates for addition to the 
tree. 

To efficiently find the edge to add to the current tree 
the algorithm uses the dau structure operations de- 
scribed above. As soon as a vertex, say v, is adjacent to 
some vertex which is currently spanned it is inserted in 
the set h. The key value for v is the weight of the mini- 
mum edge between v and some vertex spanned by the 
current tree. The array element prefer (v) is used to 
keep track of this m i n i m um weight edge. As the tree 
grows, information is updated by operations such as 
insert (t>k,h) and changekey (Ut4i). 

TA BLE I 

OtA structure Operatic** tad oertificauoo 
tml for MINSPAN 


Operttioo 

Set of Ordered Pain 

Delete 

Trail 

mscri(2J00) 

0200) 


Bullwn 

n»ert(6,300) 

(2JOOM4.JOO) 


2 

dcietemia 

(WOO) 

0200) 


®senO,a00) 

(6,500^3.100) 

a 

chugekc ><6,450) 

(6,450^3.100) 


m*Dcsi 

®*ert<7.50S) 

(6.450),(7 J05U3.IOO) 


6 

deietemm 

(7.J05MJ.100) 

(6.450) 


m*ot(5J50) 

(5jsoMJjojMJ.no) 

amiflcsi 

chugeLey<7.495) 

(JJJOM7.W5MJ.no) 


5 

deletciniB 

(7.4WMM0O) 

(5450) 


chaafekcy(3.330) 

(J.JJ0M7.WJ) 

smartest 

m*ert(4,700) 

(JJJOM7.WJM4.700) 


7 

dtletnraa 

(7.49JM4.700) 

(3.350) 


cb*Jtgekey{4,650) 

(7.49JM4.450) 

7 

ddcieoua 

deletemia 

dcietemia 

(4.430) 

0495) 

(4.630) 

empty 



The deletemin (h) operation is used to select the next 
vertex to add to the span of the current tree. Note, the 
algorithm does not explicitly keep a set of edges repre- 
senting the current tree. Implicitly, however, if (v,k) is 
returned by deletemin then prefer (v) is added to the 5 
current tree. 

In the first execution of the MINSPAN algorithm, 
the MINSPAN code is used and the principle data 
structure is implemented with a balanced tree such as an 
AVL tree [Aderson-Vel’skii, G.M., and Landis, E.M., 1° 
"An algorithm for the organization of information”, 
Soviet Math. Dokl.. pp. 1259-1262, 3, 1962], a red-black 
tree [Guibas, LJ., and Sedge wick, R., "A dichromatic 
Framework for balanced trees”, Proceedings of the 
Nineteenth Annual Symposium on Foundations of 15 
Computing, pp. 8-21, IEEE Computer Society Press, 
1978] or a b-trec [Bayer, R., and McCreight, E., “Orga- 
nization of large ordered indexes”, Acta Inform., pp 
173-189, 1, 1972]. In addition, an airay of pointers in- 
dexed from 1 to n is used. The balanced search tree 20 
stores the ordered pain in h and is based on the total 
order described earlier. The array of pointers is initially 
all nil. For each item i, the ith pointer of the array fat 
used to point to the location of the ordered pair with ^ 
item number » in the balanced search tree. If there is no 
such ordered pair in the tree then the ith pointer is nil. 
This array allows npid execution of operations such as 
member (i,h) and delete (i,h). 

The certification trail is generated during the first ^ 
execution as follows: When CHOOSE root € V is exe- 
cuted in the first step, the vertex which is chosen is 
output. Also, each time insert (i,k,h) or changekey 
(i,k,h) are executed, predecessor (uh) is executed after- 
wards, and the answer returned is output This is ill us- 3$ 
trated in column labeled ‘Trail” in the table above. 

The second execution of the MINSPAN algorithm 
also uses the MINSPAN code; however, the CHOOSE 
construct and the data structure operations are imple- 
mented differently than in the fist execution. The 4 q 
CHOOSE is performed by simply reading the first ele- 
ment of the certification trail. This guarantees the same 
choice of a starting vertex is made in both executions. 
FIG. 4 depicts the principal data structure used which is 
called an indexed linked list. The array is indexed from 45 
1 to n and contains pointers to a singly linked list which 
represents the current contents of h from smallest to 
largest. The ith element of the array points to the node 
containing the ordered pair with the item number i if it 
is present in h; otherwise, the pointer is nil. The Oth 50 
element of the array points to the node containing (0, 
-INF). Initially, the array contains nil pointers except 
the Oth element. In order to implement the data struc- 
ture operations, the following is provided. 

To perform insert (i,k,h), it is necessary to read the 55 
next value in the certification trail. This value, say j, is 
the item number of the ordered pair which is the prede- 
cessor of (i,k) in the current contents of h. A new linked 
list node is allocated and the trail information is used to 
insert the node into the data structure. Specifically, the 60 
ith array pointer is traversed to a node in the linked list, 
say Y. (If j * "smallest” then the Oth array pointer is 
traversed.) The new node is inserted in the list just after 
. node Y and before the next node in the linked list (if 
there is one). The data field in the new node is set to (i,k) 65 
* / and the ith pointer of the array is set to point to the new 
node. FIG. 4 shows the insertion of (7,505) into the data 
structure given that the certification trail value is 6. 
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FIG. 3(a) is before the insertion and FIG 3 (b) is after 
the insertion. 

When the insert operation is performed, some checks 
must be conducted. First, the ith array pointer must be 
nil before the operation is performed. Section, the 
sorted order of the pain stored in the linked list must be 
preserved after the operation. That is, if (i\k') is stored 
in the node before (i,k) in the linked list and (i",k M ) is 
stored after (i,k), that (i'JO < (uk) < 6”. k”) must bold 
in the total order. If either of these checks fails then 
execution halts and “error” is output. 

To perform delete (i,h) the ith array pointer is tra- 
versed and the node found is deleted from the linked 
list Next, the ith array pointer is set to nil. FIG. 4 shows 
the deletion of item number 7 if one considen FIG. 3(a) 
as depicting the data structure before the operation and 
FIG. 3(b) depicting it afterwards. When the delete oper- 
ation is performed ooe check is made. If the ith array 
pointer is nfl before the operation then the execution 
halts and “error” is output 

To perform changekey (U*h) it suffices to perform 
delete (i,b) followed by insert Q,k,h). Note, this means 
the next item in the certification trail is read. Also, the 
checks associated with both these two operations are 
performed and the execution halts with “error” output 
if any check fails. 

To perform deteJemin (h) the Oth array pointer is 
traversed To the bead of the list and the next node in the 
list is accessed. If there is no such node then "empty” is 
returned and the operation is complete. Otherwise, 
suppose the node is Y and suppose it contains the or- 
dered pair (Ut), then the node Y is deleted from the list, 
the ith array pointer is set to oil, and (i,k) is returned. 

Lastly, to perform member (wh) the ith array pointer 
is examined. If it is nil then fake is returned, otherwise, 
true is returned. The predecessor (ith) operation is not 
used int he second execution. 

This completes the description of the second execu- 
tion. To show that there is described a correct imple- 
mentation of the certification trail method requires a 
proof. The proof has several parts of varying difficulty. 
Fust, one must show that if the first execution is fault- 
free then it outputs a minimum spanning tree. Second, 
one must show that if the first and second executions are 
fault-free then they both output the same minimum 
spanning tree. Both these parts of the proof are not 
difficult to show. 

The third more subtle part of the proof deals with the 
situation in which only the second execution is fault- 
free. This means an incorrect certification trail may be 
generated in the first execution. In this case, it must be 
shown that the secood execution outputs either the 
correct minimum spanning tree or “error”. The checks 
that were described this property by detecting any er- 
rors that would prevent the execution from generating 
the correct output. 

In the first execution each data structure operation 
can be performed in 0(k>g(n)) time where [V] = n. 
There are at most 0(m) such operations and 0(m) addi- 
tional time overhead where [E] = m. Thus, the first 
execution can be performed in O(mlog(n)). It is noted 
that th is algorithm does not achieve the fastest known 
asymptotic time complexity which appears in Gabow, 
H.N., Gmlil, Z. f Spencer, T., and Tarjan, R.E., “Effi- 
cient algorithms for finding minimum spanning trees in 
undirected and directed graphs,” Combinatorica 6, pp. 
109-122, 2, 1936. However, the algorithm presented 
here has a significantly smaller constant of proportion- 
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Ally which makes it competitive for reasonably sized 
graphs. In addition, it provides us with a relatively 
simple and illustrative example of the use of a certifica- 
tion trail. 

In the second execution each data structure operation 
can be performed in 0(1). There are still at most 0(m) 
such operations and 0(m) additional time overhead. 
Hence, the second execution can be performed in 0(m) 
time. In other words, because of the availability of the 
certification trail, the second execution is performed in 
linear time. There are no known 0(m) time algorithms 
for the minimum spanning tree problem. KomJos [26] 
was able to show that 0(m) comparisons suffice to find 
the minimum spanning tree. However, there is no 
known 0(m) time algorithm to actually find and per- 
form these comparisons. Even the related “verification 
problem has no known linear time solution. In the veri- 
fication problem the input consists of an edge weighted 
graph and a subtree. The output is “yes” if the subtree 
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also uses the command allocate to construct the tree 
This command allocates a new node and returns a 
pointer to it. Each node is able to store an item number 
and a key value in the field called info, the item numbers 
are in the set (I, ... f 2n — 1) and the key values are 
sums of frequency values. The nodes also contain fields 
for left and right pointers since the tree being con- 
structed is binary. 

The Huffman tree is built from the bottom up and the 
overall structure of the algorithm is based on the greedy 
“merging” of subtrees. An array of pointers called ptr is 
used to point to the subtrees as they are constructed 
Initially, n single vertex subtrees with the smallest asso- 
ciated frequency values. To perform a merge a new 
subtree is created by first allocating a new root node 
and next setting the left and right pointers to the two 
subtrees being merged. The frequency associated with 
the new subtree is the sum of the frequencies of the two 
subtrees being merged. In FIO. 6 the frequency associ- 
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the root vertex of the subtree. Details of the algorithm 
are given below. Note that the priority queue data 
structure allows the algorithm to quickly determine 
which subtrees should be merged by enabling the two 
smallest frequency values to be found efficiently during 
each iteration. 

Table 1 below illustrates the data structure operations 
performed when the Huffman tree in FIG. 6 is con- 
structed. For conciseness the initial n inset operations 
have been omitted. The first column gives the set of 
ordered pairs in h. The second column gives the result- 
of the two deletemin operations during each iteration. 
Note that this column is labeled “Trail” because it is 
also output as the certification trail The third column 
records the elements which are inserted by the com- 
mand on line 13. 

TABLE 2 


best known algorithm for this problem was created by 
Tarjan [Tarjan, R.E., “Applications of path compres- 
sion on balanced trees” J. ACM, pp. 690-715, October, 
1979] and has the nonlinear time complexity of Of- 
ma(m,n)), where a(m,n) is a functional inverse of Ack- 25 
erman's function. The fact that the data in a certification 
trail enables a minimum spanning tree to be found in 
linear time is, we believe, intriguing, significant, and 
indicative of the great promise of the certification trail 
technique. 30 

Huffman Tree Example 

Huffman trees represent another classic algorithmic 
problem, one of the original solutions being attributed 
to Huffman [Huffman, D,, “A method for the construe- 35 
tion of minimum redundancy codes”, Proc. IRE, pp. 
1098-1 101, 40, 1952]. This solution has been used exten- 
sively to perform data compression through the design 
and use of so-called Huffman codes. These codes are 
prefix codes which are based on the Huffman tree and 40 
which yield excellent data compression ratios. The tree 
structure and the code design are based on the frequen- 
cies of individual characters in the data to be com- 
pressed. See Huffman, D., “A method for the construc- 
tion of minimum redundancy codes”, Proc. IRE, pp. 45 
1098-1101, 40, 1952, for information about the coding 
application. 

Definition 3.3. The Huffman tree problem is the fol- 
lowing: Given a sequence of frequencies (positive inte- 
gers) fit], f[2 ], . . . , fjn] t construct a tree with n leaves 50 
and with one frequency value assigned to each leaf so 
that the weighted path length is minimized. Specifi- 
cally, the tree should minimize the following sum: 2/* 
ZX4/len(i)fIi] where LEAF is the set of leaves, len(i) is 
the length of the path from the root of the tree to the 55 
leaf Lfii] is the frequency assigned to the leaf I,. 

An example of a Huffman tree is given in FIG. 6. The 
input frequencies are: fi[l) - 35, fl(2) = 20, fl(3) = 44, 
f(4) = 77, ftS) = 23, f[6) m 38, and f(7) - 88. The 
frequencies appear inside the leaf nodes as the second 60 
elements of the ordered pairs in the figure. 

HUFFMAN ALGORITHM 

The algorithm to construct the Huffman tree uses a 
data structure which is able to implement the insert and 65 
the deletemin operations which are defined above in the 
minimum spanning tree example. This type of data 
structure is often called a priority queue. The algorithm 
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First Execution of HUFFMAN 

In this execution the code entitled HUFFMAN is 
used and the priority queue data structure is imple- 
mented with a heap [Tarjan, R.E., Data Structures and 
Network Algorithms, Society for Industrial and Ap- 
plied Mathematics, Philadelphia, Pa. 1983] or a bal- 
anced search tree [Guibas, LJ., and Sedge wick, R., M A 
dichromatic framework for balanced trees”, Proceed- 
ings of the Nineteenth Annual Symposium on Founda- 
tions of Computing, pp. 8-21, IEEE computer Society 
Press, 1978; Adel 4 son-Vd-VeP*kii, G.M., and Landis, 
E.M., “An algorithm for the organization of informa- 
tion” Soviet Math. Doki, pp. 1259-1262, 3, 1962; 
Bayer, R. f and McCreight, E., “Organization of large 
ordered indexes”, Acta Inform., pp. 173-189, 1, 1972]. 
Actually, any correct implementation is acceptable; 
however, to achieve a reasonable time complexity for 
this execution the suggested implementation are desir- 
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able, the certification trail is generated as follows: 
whenever deletemin (h) is executed the item number 
and the key vilue which are returned are both output. 

In the table, the certification trail is listed in the second 
column. 

Second Execution of HUFFMAN 

This execution consists of two parts which may be 
logically separated but which are perform ed to gether. 

In the first logical part, the code called HUFFMAN is 
executed again except that the data structure operations 
are treated differently. All insert operations are not 
performed and all deletemin operations are performed 
by simply reading the ordered pairs from the certifica- 
tion trail. In the second logical part, the data structure 
operations are "verified”. Note, by "verify" it doe* not 
mean a formal proof of correctness based on the text of 
an algorithm. The problem of verification can be formu- 
lated as follows: given a sequence of insert (i,k,h) and 
deletemin (h) operations (h) operations check to see if 20 
the answers are correct. It should be noted that while in 
our example there is only one h, in general there can be 
multiple h’s to be handled. 

The description of the algorithm for the second exe- 
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execution for the Huffman tree problem to be dramati- 
cally more efficient than the first. 

In the first execution of HUFFMAN, each data struc* 
ture operation can be performed in OOog(n)) time 
where n is the number of frequencies in the input. There 
are 0(n) such operations and 0(n) additional time over- 
head, hence, the execution can be performed in (Xn log 
(o)). This is the same complexity as the best known 
algorithm for constructing Huffman tr ees. 

In the second code execution of HUFFMAN, each 
data structure operations is performed in constant time 
Further, verifying the data structure operations arc 
correct takes only a constant time per operation. Thus, 
it follows that the overall complexity of the second 
execution is only O (a). 

Convex Hull Example 

The convex hull problem is fundamental in computa- 
tional geometry. The certification trail solution to the 
generation of a convex hull is based on a solution due to 
Graham (Graham, ILL., "An efficient algorithm for 
determining the convex hull of a planar set", Informa- 
tion Processing Letters, pp . 132-133* 1 1972] which is 
called "Graham's Scan." (For basic definitions and 
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stricted types of operation sequences are generated by 
the HUFFMAN code. First, it can be observed that all 
elements are ultimately deleted from h before the algo- 
rithm terminates; second, it can be further observed that 
when an element is inserted into h, its key value is larger 
than the key value of the last element deleted from h. 
These two important observations allow us to check a 
sequence using the simplified method which is de- 
scribed next 


Prcparata Shamos (Prcparata F .P *, and Sham os 
M L, Computational geometry; an introduction. Spring- 
er- Verlag, New York, N.Y., 1985].) For simplicity in 
the discussion which follows, it is assumed the points 
30 are in so-called "general position" (this is, no three 
points are colinear). It is not difficult to remove this 
restriction. , 

Definition 3.4. A convex region in R 2 is a set of 
points, say Q, in R 2 such that for every pair of points in 


Our simplified method uses an array of integers in- 35 Q the line segment connecting the points lies entirely 
_ - r . _ j . . i. .l _ .. n a haIuoam k * nrriilftrlv ordered set of line 
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dexed from 1 to 2n - 1. This array is used to track the 
contents of h. If the ordered pair (i.k) is in h, then array 
element i is set to a value of k; and if no ordered pair 
with item number i is in h, then array element i is set to 
a value of — l. Initially, all array elements are set to - 1 
and then operation sequence is processed. If insert fuk) 
is executed then array element i is checked to see if it 
contains — 1. (The value of - 1 is an arbitrary selection 
meant to serve only as an indicator.) If array element i 
does contain — 1 , then it is set to k. If deletemin (h) is 43 
executed, then the answer indicated by the certification 
trail, say (i,k), is examined. Array element i is checked 
to see if it contains k. In addition, k is compared to the 
key value of previous element in the certification trail 
sequence to see if it is greater than or equal to that 
previous value. If both these checks succeed then array 
element i is set to — 1 . 

If any of the checks just described above fails, then 
the execution halts and “error" is output Otherwise the 
operation sequence is considered "verified”. It can be 
rigorously shown that the checks described are suffi- 
cient for determining whether the answers given in the 
certification trail are correct; this proof, however, has 
been omitted for the sake of brevity. Finally, it is worth 
noting that to combine the two logical parts of this 60 
execution, one can perform the data structure che cking 
in tandem with the code execution of HUFFMAN. 
Each time an insert or deletemin is encountered in the 
code, the appropriate set of checks are performed. 

Time Complexity Comparison of the Two Executions 

Again, as in the minimum spanning tree example, the 
availability of the certification trail permits the second 
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within Q. A polygon is a circularly ordered set of line 
segments such that each line segment shares one of its 
endpoints with the preceding line segment and shares 
the other endpoint with the succeeding line segment in 
the ordering. The shared endpoints are called the verti- 
ces of the polygon. A polygon may also be specified by 
an ordering of its vertices. A convex polygon is a poly- 
gon which is the boundary of some convex region. The 
convex hull of a set of points, S, in the Euclidean plane 
is defined as the smallest convex polygon enclosing all 
the points. This polygon is unique and its vertices are a 
subset of the points in S. It is specified by a counter- 
clockwise sequence of its vertices. 

FIG. 8 (c) shows a convex hull for the points indicated 
by black dots. Graham’s can algorithm given below 
constructs the convex hull incrementally in a counter- 
clockwise fashion. Sometimes it is necessary for the 
algorithm to "backup" the construction by throwing 
some vertices out and then continuing. The first step of 
the algorithm selects an "extreme" point and calls it pj. 
The next two steps sort the remaining points in a way 
which is depicted in FIG. 8 ( 0 ). It is not hard to show 
that after these three steps the points when taken in 
order, Pi, pj, . . - , p* form a simple polygon; although, 
in general, this polygon is not convex. 

Graham's Scan Algorithm 

It is possible to think of Graham's scan algorithm as 
removing points from this simple polygon until it be- 
comes convex, the main FOR loop iteration adds verti- 
ces to the polygon under construction and the inner 
WHILE loop removes vertices from the construction. 
A point is removed when the angle test performed at 
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Step 6 reveals that it is not on the convex hull because 
it falls within the triangle defined by three other points. 

A “snapshot** of the algorithm given in FIG. S<6) shows 
that qs i* removed from the hull. The angle formed by 
q4,qs, p* is less than 180 degrees. This means, qs lies 5 
within the triangle formed by q*, pi, p*. (Note, qi = pi.) 

In general, when the angle test is performed, if the angle 
formed by qm — l,qm,pk is less than 180 degrees, then 
qm lies within the triangle formed by qm — i,pl,pk. 
Below it will be revealed that this is the primary infor- 
nation relied on in our certification trail. When the 
main FOR loop is complete, the convex hull has been 
constructed. 

First Execution of Graham's Scan 

In this execution the code CONVEXHULL is used. 
The certification trail is generated by adding an output 
statement within the WHILE loop. Specifically, if an 
angle of less than 180 degrees is found in the WHILE 
loop test then the four tuple consisting of 20 
qm,qm— Lpl.pk is output to the certification trail. 
Table 3 below shows the four tuples of points that 
would be output by the algorithm when run on the 
example in FIG. 8. The points in Table 3 are given the 
same names as in FIG. 1(d). The final convex hull points w 
ql, . . . qm are also output to the certification trail. 
Strictly speaking the trail output does not consist of the 
actual points in R 2 . Instead, it consists of indices to the 
original input data. This means if the original data con- 
sists of S|,S2 , , . . , s« then rather than output the dement ^ 
in R 2 corresponding to s, the number i is output. It is not 
hard to code the program so that this is done. 


TABLE 3 


Fir* pan of certification nail for Graham's scan 

Point not on convex twfl 

Three unrounding points 

P5 

P4.pl.fe 

fe 

P2.Pl.fe 

r 

P4.Pi.Pt 


Second Execution for the Convex Hall Problem 

Let the certification trail consist of a set of four tu- 
ples, (x| t ai,bi,ci), (x2a2.b2.c2) (xr.tr.b^Cr) followed 

by the supposed convex hull, qi,q2. ■ . - .qm- The code 
for CONVEXHULL is not used in this execution. In- 45 
deed, the algorithm performed is dramatically different 
than CONVEXHULL. 

It consists of five checks on the trail data. 

First, the algorithm checks for i c (I, . . . ,r) that x, lies 
within the triangle defined by a>,b„ and c*. X) 

Second, the algorithm checks that for each triple of 
counterclockwise consecutive points on the supposed 
convex hull the angle formed by the points is less than 
or equal to ] 80 degrees. 

Third, it checks that there is a one to one correspon- 55 
dence between the input points and the points in (xj, 

. . . ,Xr) U (qi, . . ,qjn). 

Fourth, it checks that for i < (1, . . . ,r), a/b* and c/are 
among the input points. 

Fifth, it checks that there is a unique point among the 60 
points on the supposed convex hull which is a local 
extreme point. A point q on the hull is a local extreme 
point if its predecessor in the counterclockwise order- 
ing has a strictly smaller y coordinate and its succes* . 
•or in the ordering has a smaller or equal y coordi- 65 
nate. 

If any of these checks fail then execution halts and 
“error" is output As mentioned above, the trail data 
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actually consists of indices into the input data this does 
not unduly complicate the checks above; instead it 
makes them easier. The correctness and adequacy of 
these checks must be proven. 

Time Complexity of the Two Executions 

In the first execution the sorting of the input points 
takes 0(nlog(n) time where n is the number of input 
points. One can show that this cost dominates and the 
overall complexity is 0(nlog(n)). 

It is possible to note that, unlike the minimum span- 
ning tree example and the Huffman tree example, the 
convex hull example utilizes an algorithm in the second 
execution that is not a dose variant of that used int be 
first execution. However, like the previous two exam- 
ples, the second execution for the convex hull problem 
depends fundamentally on the information in the certifi- 
cation trail for efficiency and performance. 

Concurrency of Executions 

In the three examples discussed above, it is possible to 
start the second execution before the first execution has 
terminated. This is a highly desirable capability when 
additional hardware is available to run the second exe- 
cution (for example, with multiprocessor machines, or 
machines with coprocessors or hardware monitors). 

In the case of the minimum spanning tree problem, 
the two executions can be run concurrently. It is only 
necessary for the second execution to read the certifica- 
tion trail as it is generated — one hem number at a time. 
Thus, there is a slight time lag in the second execution. 
The case of the Huffman tree problem is similar. Both 
executions can be run concurrently if the secood execu- 
tion reads the certification trail as it is generated by the 
first execution. 

The case of the convex hull problem is not quite as 
favorable, but it is still possible to partially overlap the 
two executions. For example, as each 4-tuple of points is 
generated by the first execution, it can be checked by 
the second execution. But the second execution must 
wait for the points on the convex hull to be output at the 
end of the fist execution before they can be checked. 

An additional opportunity for overlapping execution 
occurs when the system has a dedicated comparator. In 
this case it is sometimes possible for the two executions 
to send their output to the comparator as they generate 
it. For example, this can be done in the minimum span- 
ning tree problem where the edges of the tree can be 
sent individually as they are discovered by both execu- 
tions. 


Comparison of Techniques 

The certification trail approach to fault tolerance, 
whether implemented in hardware or software or some 
combination thereof, has resemblances with other fault 
tolerant techniques that have been previously proposed 
and examined, but in each case there are significant and 
fundamental distinctions. These distinctions are primar- 
ily related to the generation and character of the certifi- 
cation trail and the manner in which the secondary 
algorithm or syttem uses the certification trail to indi- 
cate whether the execution of the primary system or 
algorithm was in error and/or to produce an output to 
be compared with that of the primary system. 

To being, the certification trail approach might be 
viewed as a form of N-version programming [Chen, L., 
and Avizienis A., "N-version programming: a fault 
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tolerant approach to reliability of toft ware operation, 
Digest of the 1978 Fault Tolerant Computing Sympo- 
sium, pp. 3-9, IEEE computer Society Press, 1978, 
Avixienis, A., and Kelly J., “Fault tolerance by design 
diversity: concepts and experiments,” Computer, vol 
17, pp. 67-80. August, 1984]. This approach specifies 
that N different implementations of an algorithm be 
independently executed with subsequent comparison of 
the resulting N outputs. There is no relationship among 


independently executed with subsequent comparison oi 7" wilitv checkin a ” Di- 


collects or is sent information about the operation of the 
system to be compared with that which was provided 
during the set-up phase. On the basis of this comparison, 
a decision is made by the watchdog processor as to 
whether or not an error has occurred. The roformatjon 
about system behavior by means of which a watchdog 
processor must monitor for errors includes memory 
access behavior [Namjoo, M., and McCluskey, E,, 
'Watchdog processors and capability checking, Di- 


n thins other than they all use the same input; each algo- 
rithm is executed independently without any informa- 
tion about the execution of the other algorithms. In 

marked contrast, the certification trail approach allows ; - ' ~ 20-22 Iyengar, 
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pp. 245-248, IEEE Computer Society Press, 1982], 
control end program flow [Eifert, J. B. end Shen, J. P-, 
“Processor monitoring using asynchronous signatured 
instruction streams,” Dig. 14th Int Conf. Fault-Toler- 
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wuiie executing its algorithm 

oodary system’s execution of iu algorithm. In effect, 

N~ version programming can be thought of relative to 
the certification trail approach as the employment of a 

null trail. . 

A software/hardware fault tolerance technique 
known as the recovery block approach [Randell, 

“System structure for software fault tolerance,* IEEE 
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June, 1975, Anderson, T., and Lee, » •» , • w u tpbrfriI 816' An architecture for a 
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and evaluation of a fault-tolerant multiprocessor using Annu. F,ul ‘ ^.Un<i 

hardware recovery blocks,” IEEE Trans. Comput, vol June. 1983; 

CM. pp. 115-124, Febnrtr, l»«lj~4 «“1^“ » 

tests and alternative procedures to produce what b to Proc. 1983 Int. Test Com., 
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microprogrammed control units,” IEEE Trans. Corn- 
put, vol. C-34, pp. 810-821, September 1985; Kane, J. 
R. and Yau, S. S. f “Concurrent software fault detection, 
" IEEE Trans. Software Eng., vol SE-1, pp. 87-99, 
March 1975; Lu, D n "Watchdog processor and struc- 
tural integrity checking, ” IEEE Trans. Comput, vol 
C-31, pp. 681-685, July 1982; Namjoo, M., “Techniques 
for concurrent testing of VLSI processor operation. 


be regarded as a correct output from a program. When 
using recovery blocks, a program is viewed as being 
structured into blocks of operations which after execu- 
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Sridhar, T. and Thatte, S. M., “Concurrent checking of 
program flow in VLSI processors,” Dig. 1982 Int Test 
Conf., pp. 191-199, November, 1982; 46,47], or reason- 
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Computing Symposium, pp. 245-248, IEEE Computer , t h e P wtlc hdog processor approach is that 

information about system behavior is provided a pnon Conf Circuits, Syst., Comput, pp. 

to the watchdog processor about the system to be morn- ^ws/ D. nisihg 

lored; in the monitoring phase, the watchdog processor 641-645. 1978, November 6-8, Andrews, u.. us. g 
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executable assertions for testing and fault tolerance,” 
Dig. 9th Annu. Int. Sump. Fault-Tolerant Cocnput., pp. 
102-105, 1979, June 20-22; Mahwood. A., Lu. D. J and 
McCluskey E. J., “Concurrent fault detection using a 
watchdog processor and assertions,” Proc. 1983 Int. 
Test Conf., pp. 622-628, October 1983]. An assertion 
can be defined as an invariant relationship among vari- 
ables of a process. In a program, for examples, asser- 
tions can be written as logical statements and can be 
inserted into the code to signify that which has been 
predetermined to be invariably true at that point in the 
execution of the program. Assertions are based on a 
priori determined properties of the primary system or 
algorithm. This, however, again serves to distinguish 
executable assertion technique from the use of certifica- 
tion trails in that a certification trail is a key to the 
solution of s problem or the execution of an algorithm 
that can be utilized to efficiently and correctly produce 
the solution. 

Algorithm-based fault tolerance [Huang, K.-H., and 
Abraham, J., “Algorithm -based fault tolerance for ma- 
trix operations,” IEEE Trans, on Computers, pp. 
518-529, vol. C-33, June, 1984; Nair, V., and Abraham, 
J. # “General linear codes for fault-tolerant matrix opera- 
tions on processor arrays,” Dig. of the 1988 Fault Tol- 
erant Computing Symposium, pp. 180-185, June, 1988; 
“Fault tolerant FTT networks,” Dig. of the 1985 Fault 
Tolerant Computing Symposium, June, 1985] uses error 
detecting and correcting codes for performing reliable 
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cause it is allowed to be probabilistic in a carefully 
specified way. There are two main differences between 
this approach and the certification trail approach. First, 
a program checker may call the algorithm it is checking 
* polynomial number of times. In the certification trail 
approach the algorithm being checked is run once. 
Second, the checker is designed to work for a problem 
and not a specific algorithm. That is, the checker design 
is based on the input/output specification of a problem. 
The certification trail approach is explicitly algorithm 
being checked is run once. Second, the checker is de- 
signed to work for a problem and not a specific algo- 
rithm. That is, the checker design is based on the input - 
/output specification of a problem. The certification 
1 5 trail approach is explicitly algorithm oriented. In other 
words, a specific algorithm for a problem is modified to 
out put a certifications trail. This trail sometimes allows 
the second execution to be faster than any known pro- 
gram checkers for the problem. This is the case for the 
minimum s panning tree problem. 

Other hardware and software fault tolerance and 
error monitoring techniques have been proposed and 
studied that might be thought of as bearing some resem- 
blance to the certification trail approach. Extensive 
summaries and descriptions of these techniques can be 
found in the literature [Siewiorek, D., and Swan, R., 
The theory and practice of reliable design. Digital 
Press, Bedford, Mass., 1912; Avirienis, A., “Fault toler- 
ance by means of external monitoring of computer sys- 
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computations with specific algorithms. This technique 30 terns,” Proceedings of the 1981 National Computer 

r . .... ... . . ■ » ^ i> m i/I iCTDC Dmm lOftfY lAhfivwi H 


encodes data at s high level and algorithms are specifi- 
cally designed or modified to operate on encoded data 
and produce encoded output data. Algorithm -based 
fault tolerance is distinguished from other fault toler- 
ance techniques by three characteristics: the encoding 
of the data used by the algorithm; the modification of 
the algorithm to operate on the encoded data; and the 
distribution of the computation steps in the algorithm 
among computational units. It is assumed that at most 


Conference, pp* 27-40, AFIPS Press, 1980; Johnson, B., 
Design and analysts of fault tolerant digital systems, 
Addison-Wesley, Reading, Mass., 1989; Mahmood, A., 
and McCluskey, E, “Concurrent error detection using 
35 watchdog processors— a survey,” IEEE Trans, on 
Computers, vol. 37, pp. 160-174, February, 1988]. Ex- 
amination of these techniques reveals, however, that in 
each case there are fundamental distinctions from the 
certification trail approach. In summary, the certifies- 
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computational unit is faulty during a specified time 40 tion trail approach stands along m its employment or 

* * secondary algorithms/ systems for the computation of 

an output for comparison that because of the availability 
of the trail not only proceeds in a more efficient manner 
than that of the primary but also can indicate whether 
45 the execution of the primary algorithm was correct. 
Although the invention has been described in detail in 
the foregoing embodiments for the purpose of illustra- 
tion, it is to be understood that such detail is solely for 
that purpose and that variations can be made therein by 
50 those skilled in the art without departing from the spirit 
and scope of the invention except as it may be described 
by the following claim*. 

What is claimed is: 

l. A method for achieving fault tolerance in a com- 
55 puter system having at least a first central processing 
unit and a second central processing unit comprising the 
steps of: 

executing a first algorithm in the first central process- 
ing unit on input so that a first output and a certifi- 
60 cation trail are produced; 

executing a second algorithm in the second central 
processing unit on the input and on the certification 
trail so that a second output is produced, said sec- 
ond algorithm having a faster execution time than 
65 the first algorithm for a given input; and 

comparing the first and second outputs such that an 
enor result is produced if the first and second out- 
puts are not the same. 


one 

period. The error detection capabilities of the al- 
gorithm-based fault tolerance approach are directly 
related to that of the error correction encoding utilized. 
The certification trail approach does not require that 
the dau to be executed be modified nor that the funda- 
mental operations of the algorithm be changed to ac- 
count for these modifications. Instead, only a trail indic- 
ative of aspects of the algorithm’s operations must be 
generated by the algorithm. As seen from the above 
examples, the production of this trail does not burden 
the algorithm with a significant overhead. Moreover, 
any combination of computational errors can be han- 
dled. 

Recently Blum and Kannan [Blum, M., and Kaiman, 
S., “Designing programs that check their work,” Pro- 
ceedings of the 1989 ACM Symposium on Theory of 
Computing, pp. 86-97, ACM Press, 1989] have defined 
what they call a program checker. A program checker 
is an algorithm which checks the output of an other 
algorithm for correctness and thus it is similar to an 
acceptance test in a recovery block. An example of a 
program checker is the algorithm developed by Tarjan 
[Tarjan, R. E., “Applications of path compression on 
balanced trees,” J. ACM, pp. 690-715, October, 1979] 
which takes as input a graph and a supposed minimum 
spanning tree and indicates whether or not the tree 
actually is a minimum spanning tree. The Blum and 
Kannan checker is actually more general than this be- 



2 . A method as described in claim 1 wherein the step 
of executing the second algorithm includes the step of 
determining whether the certification trail is in error. 

3. A method as described in claim 2 including before 
the step of executing the first algorithm, there is the step 
of duplicating the input such that the input that is pro- 
vided to the step of executing the first algorithm is also 
the input that is provided to the step of executing the 
second algorithm. 

4. A method as described in claim 3 wherein the step 
of executing the first algorithm includes the step of 
determining whether the first output is in error. 

5. A method as described in claim 4 wherein the step 
of executing the first algorithm includes the step of 
determining whether the second output is in error. 

6 . A method as described in claim 5 wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the second 
processing unit even if the certification trial produced 
by the first algorithm when the first algorithm is exe- 
cuted by the first processing unit is incorrect. 

7. A method as described in claim 1 wherein the 
second algorithm is derived from the first algorithm. 

S. A computer system comprising: 
a first computer comprising, 
a first memory, 

a first central processing unit in communication with 
the memory, 

a first input port in communication with the memory 
and the first central processing unit, 
a first algorithm disposed in the first memory, said 
first algorithm produces a first output and produces 
a certification trail based on input received by the 
input port when the first algorithm is executed by 
the first central processor; 
a second computer comprising a second memory, 
a second central processing unit in communication 
with the second memory and the first central pro- 
cessing unit; 

a second input port in communication with the sec- 
ond memory and the second central processing 
unit; 

a second algorithm disposed in the second memory, 
said second algorithm produces a second output 
based on the input and the certification trail when 
the second algorithm is executed by the second 
centra] processing unit, said second algorithm hav- 
ing a faster execution time than the first algorithm 
for a given input; and 

* mechanism for comparing the first and second out- 
puts such that an error result is produced if the first 
and second outputs are not the same. 

9. A computer as described in claim § wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the second 
processing unit even if the certification trail produced 
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by the first algorithm when the first algorithm is exe- 
cuted by the first processing unit is incorrect. 

10. A computer system as described in claim 9 
wherein the mechanism for comparing is a comparator. 
5 11. An apparatus as described in claim 10 wherein the 

second algorithm is derived from the first algorithm. 

12. A method for achieving fault tolerance in a cen- 
tral processing unit comprising the steps of: 
executing a first algorithm in the central processing 
10 unit on input so that a first output and a certifica- 
tion trail arc produced; 

executing a second algorithm in the central process- 
ing unit on the input and on the certification trail so 
that a second output is produced, said second algo- 
15 rithm having a faster execution time than the first 
algorithm for a given input; and 
comparing the first and second outputs such that an 
error result is produced if the first and second out- 
puts are not the same. 

20 13. A method as described in claim 13 wherein the 

second algorithm generates the second output correctly 
when the second algorithm is executed by the process- 
ing unit even if the certification trail produced by the 
first algorithm when it is executed by the processing 
25 unit is incorrect 

14. A method as described in claim 13 wherein the 
second algorithm is derived from the first algorithm. 

15. A computer comprising: 
a memory, 

30 a central processing unit in communication with the 
memory, 

a first input port in communication with the memory 
and the central processing unit 
a first algorithm disposed in the memory, said first 
35 algorithm produces a first output and a certifica- 
tion trail based on input received by the input port 
when the input is executed by the central process- 
ing unit; 

a second algorithm disposed in the memory, said 
40 second algorithm produces a second output based 
on the input and on at least a portion of the certifi- 
cation trail when the second algorithm is executed 
by the central processing unit, said second algo- 
rithm having a faster execution time than the first 
45 algorithm for a given input; and 

a mechanism for comparing the first and second out- 
puts such that an error result is produced if the first 
and second outputs are not the same. 

16. A computer as described in claim 15 wherein the 
50 second algorithm generates the second output correctly 

when the second algorithm is executed by the process- 
ing unit even if the certification trail produced by the 
first algorithm when the first algorithm is executed by 
the processing unit is incorrect 
55 17. A computer as described in claim 16 wherein the 

mechanism for comparing is a comparator. 

18. An apparatus as described in claim 15 wherein the 
second algorithm is derived from the first algorithm. 

* • • • • 


60 


65 


